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Abstract 

This paper presents a technique for approximating, up to any precision, the set of subgame-perfect 
equilibria (SPE) in discounted repeated games. The process starts with a single hypercube approximation 
of the set of SPE. Then the initial hypercube is gradually partitioned on to a set of smaller adjacent 
hypercubes, while those hypercubes that cannot contain any point belonging to the set of SPE are 
simultaneously withdrawn. 

Whether a given hypercube can contain an equilibrium point is verified by an appropriate mathemat- 
ical program. Three different formulations of the algorithm for both approximately computing the set of 
SPE payoffs and extracting players' strategies are then proposed: the first two that do not assume the 
presence of an external coordination between players, and the third one that assumes a certain level of 
coordination during game play for convexifying the set of continuation payoffs after any repeated game 
history. 

A special attention is paid to the question of extracting players' strategies and their represent ability 
in form of finite automata, an important feature for artificial agent systems. 



1 Introduction 



In multiagent systems (MAS) the notion of optimality cannot usually be applied to each agent separately. 
In a MAS, each agent's strategy (i.e., a plan specifying its behavior for every possible situation) can only 
be considered optimal if it maximizes that agent's utility function, subject to the constraints induced by the 
respective strategies of the other agents - members of the same MAS. When each agent's strategy is optimal 
in this interdependent sense, the combination of agents' strategies is called an equilibrium: as long as no 
agent can individually improve its utility, all agents prefer to keep their strategies constant. 

Given a MAS, a first problem consists of finding a compact yet sufficiently rich form of representing such 



strategic interactions. Game theory provides a powerful framework for this. Repeated games ( |Fudenberg 
|and Tiro"Ie| |1991| Osborne and Rubinstein, 1999 Mailath and Samuelson, 2006) are an important game 
theoretic formalism permitting modeling and studying the long-term strategic interactions between multiple 
selfish optimizers. 

Probably the most known example of a repeated game is Prisoner's Dilemma whose example is shown 
in Figure [I] In this game, there are two players, and each of them can make two actions: C or D. When 
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Figure 1: The payoff matrix of Prisoner's Dilemma, 
those players simultaneously perform their actions, the pair of actions induces a numerical payoff obtained 
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by each player. The game then passes to the next stage, where it can be played again by the same pair of 
players. 

Game theory assumes that the goal of each player is to play optimally, i.e., to maximize its utility function 
given the strategies of the other players. When the a priori information about all players' strategies and 
their real strategic preferences coincide, we talk about equilibria. 

A pair of "Tit-For-Tat" (TFT) strategies is a well-known example of equilibrium in Repeated Prisoner's 
dilemma. TFT consists of starting by playing C. Then, each player should play the same action as the very 
recent action played by its opponent. Indeed, such history dependent equilibrium brings to each player a 
higher average payoff, than that of another, stationary, equilibrium of the repeated game (a pair of strategies 
that prescribe to play D at every stage). However, an algorithmic construction of such strategies, given an 
arbitrary repeated game, is challenging. For the case where the utility function is given by the average payoff, 



Littman and Stone| d2005h propose a simple and efficient algorithm that constructs equilibrium strategies in 



two-player repeated games. On the other hand, when the players discount their future payoffs with a discount 



factor, a pair of TFT strategies constitute an equilibrium only for certain values of the discount factor. Judd 



et al. (2003) propose an approach for computing equilibria for different discount factors, but their approach 



is limited to pure strategies, and, as we will discuss below, has several other important limitations. 

In this paper, we present an algorithmic approach to the problem of computing equilibria in repeated 
games when the future payoffs are discounted. Our approach is more general than that of |Littman and Stone| 
(2005), because it allows an arbitrary discounting, and is free of four major limitations of the algorithm of 



Judd et al. (2003). Furthermore, our algorithm finds only those strategies that can be adopted by artificial 



agents. The latter are usually characterized by a finite time to compute their strategies and a finite memory 
to implement them. To the best of our knowledge, this is the first time when all these goals are achieved 
simultaneously. 

The remainder of this paper is structured as follows. In the next section, we present all necessary formal 
notions and definitions, and we formally state the problem. In Section |3j we survey the previous work, 
by pointing out its limitations. Section [4] is the principal part of this paper. In this section, we describe 
our algorithms for approximately solving repeated games with discounting and for extracting equilibrium 
strategies. In Section |5j we investigate the theoretical properties of the proposed algorithms. Section [6] 
contains an overview of some experimental results. We conclude in Section [7] with a short discussion and 
summary remarks. 



2 Problem Statement 
2.1 Stage- Game 

A stage-game is a tuple (iV, {Ai}i e jy, {r^j^iv)- In a stage-game, there is a finite set iV, \N\ = n, of 
individual players that act (play, or make their moves in the game) simultaneously. Player i G N has a 
finite set Ai of pure actions (or, simply, actions) in its disposal. When each player i among N chooses a 
certain action ai G Ai, the resulting vector a = (a±, . . . , a n ) forms an action profile, which is then played, and 
the corresponding stage-game outcome is realized. Each action profile belongs to the set of action profiles 
A = x ie jyAi. A player specific payoff function Ti specifies player z's numerical reward for different game 
outcomes. In a standard stage-game formulation, a bijection is typically assumed between the set of action 
profiles and the set of game outcomes. In this case, a player's payoff function can be defined as the mapping 
Ti : A i-^ R; also, this assumption permits, with no ambiguity, to interchangeably use the notions of action 
profile and game outcome. 

Given an action profile a, r(a) = x ie ]yri(a) is called a payoff profile. A mixed action ai of player i is a 
probability distribution over its actions, i.e., G A(Ai). A mixed action profile is a vector a = (c^)ieiv- 
We denote by a^ and a a respectively the probability to play action by player i and the probability that 
the outcome a will be realized by a, i.e., a a = ]J i a?\ The payoff function can be extended to mixed action 
profiles by taking expectations. 

The set of players' stage-game payoffs that can be generated by pure action profiles is denoted as 

F = {v eR n :3ae A s.t. v r(a)}. 
The set of feasible payoffs is the convex hull of the set F, i.e., F^ = coF. 
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Let —i stand for "all players except z". An equilibrium (or a Nash equilibrium) in a stage-game is a mixed 
action profile a with the property that for each player i and for all a[ G A(Aj), the following inequality 
holds: 

n(a) > n^a-i), 

where a = (c^, a_i). 

2.2 Repeated Game 

In a repeated game, the same stage-game is played in periods t = 0,1,2,..., also called stages. At the 
beginning of each stage, the players choose their actions that consequently form an action profile. Then they 
simultaneously play this action profile, and collect the stage-game payoffs corresponding to the resulting 
stage-game outcome. Then the repeated game passes to the next stage. When the number of game periods 
is not known in advance and can be infinite, the repeated game is called infinite. This is the scope of the 
present paper. 

The set of the repeated game histories up to period t is given by H l = x t A. The set of all possible 
histories is given by H = U^o ^ '■ ^or instance, a history ft* G H l is a stream of outcomes realized in the 
repeated game starting from period up to period t — 1: 

ft* = (a , a 1 , a 2 , . . . , a* -1 ). 

A pure strategy of player i in the repeated game, c^, is a mapping from the set of all possible histories to 
the set of player z's actions, i.e., G{ : H \-> A{. A mixed strategy of player i is a mapping G{ : H i->> A(Ai). 
Yii denotes player i 's strategy space and E = x ie N^i denotes the set of strategy profiles. 

A subgame of an original repeated game is a repeated game based on the same stage-game as the original 
repeated game but started from a given history ft*. Let a subgame be induced by a history ft*. The behavior 
of players in that subgame after a history ft r is identical to the behavior of players in the original repeated 
game after the history ft* • ft r , where ft* • ft r = (ft*, ft r ) is a concatenation of two histories. Given a strategy 
profile <j G E and a history ft G H, we denote the subgame strategy profile induced by ft as a\h. 

An outcome path in the repeated game is a possibly infinite stream of action profiles a = (a , a 1 , . . .). A 
finite prefix of length t of an outcome path corresponds to a history in H tJrl . A strategy profile a induces 
an outcome path a(cr) = (a (a), a 1 (a), a 2 (a), . . .) in the following way: 

a (a) ~ cr(0), 
aVW(aV)), 



where the notation a* (a) ~ a (ft*) means that the outcome a* is realized at stage t when the players were 
playing according to the (mixed) action profile a (ft*). Obviously, in any two independent runs of the same 
repeated game, the same pure strategy profile induces two identical outcome paths. On the contrary, at each 
period t, the action profile a*(<r) belonging to the outcome path induced by a mixed strategy profile a is a 
realization of the random process <r(ft*). 

In order to compare two repeated game strategies in terms of the utility induced by each strategy, one 
needs a criterion that permits comparing infinite payoff streams. Given an infinite sequence of payoff profiles 
v = v 1 , . . .), the discounted average payoff u](v) of this sequence for player i is given by 

oo 

U 7(v) = (l- 7 )£ 7 y, (1) 

t=0 

where 7 G [0, 1) is the discount fact 01^ One way to interpret the discount factor is to view it as a probability 
that the repeated game will continue at the next stage (similarly, (1 — 7) can be viewed as the probability that 
the repeated game stops after the current stage). This interpretation is especially convenient for artificial 
agents, because a machine has a non-zero probability of fault at any moment of time. 

-^n the notation 7*, t is the power of 7 and not a superscript. 
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Notice that in Equation ([I]), the sum of discounted payoffs is normalized by the factor (1 — 7). This 
ensures that u](v) G for any instance of v or 7. In other words, after the normalization, the player's 
discounted average payoffs can be compared both between them and with the payoffs of the stage-game. 
Notice that because a sequence of payoff profiles, v, always corresponds to an outcome path, a, one can 
interchangeably and with no ambiguity write u](v) and u](a) referring to the same quantity. 

To compare strategy profiles, a similar criterion can be defined. Let a be a pure strategy profile and 7 
be a discount factor. Then the utility of the strategy profile a for player i can be defined as 

00 

U » = (l- 7 )J>^( a V))- (2) 

t=0 

As usually, when the players' strategies are mixed, one should take an expectation over the realized outcome 
paths. 

We define a utility profile induced by strategy profile a as u 1 (a) = (u] \a))i e N- As previously, due to the 
normalization by the factor (1 — 7), for any a G £ and for any 7 G [0,1), u^(a) G . Therefore, when 
the meaning will be clear from the context, we will use the terms "payoff" and "payoff profile" to refer to, 
respectively, utility and utility profile. 



2.3 Subgame-Perfect Equilibrium 

In order to act effectively in a given environment, any agent should have a strategy. When we talk about a 
rational agent, this strategy has to be optimal in the sense that it should maximize that agent's expected 
payoff with respect to the known properties of the environment. In a single agent case, it can often be 
assumed that the properties of the environment do not change in response to the actions executed by the 



agent. In this case, it is said that the environment is stationary (Sutton and Barto 1998). In order to act 



optimally in a stationary environment, the agent has to solve the following optimization problem: 

(Ti = max E aj „ a . [ri(a,i,aj)] , 

where j denotes the environment as if it was a player repeatedly playing a mixed action aj . 

When a rational agent plays a game with other rational agents, it has to optimize in the presence of 
the other optimizing players. This makes the problem non-trivial, since an optimal strategy for one player 
depends on the strategies chosen by the other players. In this context, if the opponents change their strategies, 
the player's strategy cannot generally retain optimality. 

The concept of equilibrium describes strategies, in which all players' strategic choices simultaneously 
optimize with respect to each other. The strategy profile a is an equilibrium (or a Nash equilibrium) if, for 
each player i and its strategies d- G Sj, 

where a = (<Ji,cr_i). In other words, in the equilibrium, no player can unilaterally change its strategy so as 
to augment its own payoff. 

Another notion is important when we consider strategies in repeated games. This is the notion of 
sequential rationality or, if applied to the strategy profiles, of subgame-perfection. A strategy profile a is a 
subgame-perfect equilibrium (SPE) in the repeated game, if for all histories h G H, the subgame strategy 
profile cr\h is an equilibrium in the subgame. 

Let us first informally explain why, in the repeated games, the notion of subgame-perfection is of such a 
high importance. Consider a grim trigger strategy. This strategy is similar to TFT in that the two players 
start by playing C at the first period. Then grim trigger prescribes playing C until any player plays D, in 
which case the strategy prescribes playing D forever. Let the game be as shown in Figure [2] Observe that 
in this game, the reason why each player would prefer to play the cooperative action C while its opponent 
plays C is that the profile of two grim trigger strategies is an equilibrium when 7 is close enough to 1. Indeed, 
let Player 1 consider a possibility of deviation to the action D whenever Player 2 is supposed to play C. 
Player 1 is informed that according to the strategy profile a (which is a profile of two grim trigger strategies) 
starting from the next period, Player 2 will play D infinitely often. Thus, when 7 is sufficiently close to 1, 
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Figure 2: A game in which a profile of two grim trigger strategies is not a subgame-perfect equilibrium. 



after only one stage, at which the profile (D, D) is played following the deviation, Player 1 looses all the 
additional gain it obtains owing to the deviation. 

Now, let us suppose that Player 1 still decides to deviate after a certain history h l . It plays D whenever 
Player 2 plays C and collects the payoff of 3 instead of 2. The repeated game enters into the subgame induced 
by the history h t+1 = (/^, (D, C)). Now, according to the strategy profile cr^t+i, Player 2 is supposed to 
play D forever and "let the punishment happen". However, observe the payoffs of Player 2. If Player 2 
plays D forever, as prescribed by the Nash equilibrium, it certainly obtains the average payoff of —2 in the 
subgame, because the rational opponent (Player 1) will optimize with respect to this strategy. But if Player 2 
continues playing C, it obtains the average playoff of —1 in the subgame, while its opponent, the deviator, 
will continue enjoying the payoff of 3 at each subsequent period. As one can see, even if after the equilibrium 
histories the profile of two grim trigger strategies constitutes an equilibrium in the game shown in Figure [2j 
it is a non-equilibrium in an out- of- equilibrium subgame. Thus, due to this simple example, it becomes clear 
why, in order to implement equilibria in practice, one needs to have recourse to subgame-perfect equilibria: 
while one rational player should have no incentive to deviate being informed about the strategy prescribed to 
the opponents (the property of Nash equilibrium), the rational opponents, in turn, need to have incentives to 
follow their prescribed strategies after that player's eventual deviation (the property of subgame-perfection). 



A subgame-perfect equilibrium always exists. To see this, observe first that according to Nash (1950a 
in any stage-game, there exists an equilibrium. It is then sufficient to notice that any strategy profile that 
prescribes playing, after any history, a certain Nash equilibrium of the stage-game is a subgame-perfect 
equilibrium. 



2.4 Strategy Profile Automata 

By its definition, a player's strategy is a mapping from an infinite set of histories into the set of player's 
actions. In order to construct a strategy for an artificial agent (which is usually bounded in terms of memory 
and performance) one needs a way to specify strategies by means of finite representations. 

Intuitively, one can see that, given a strategy profile <r, two different histories h l and h T can induce 
identical continuation strategy profiles, i.e., a\^t = v\h T • For example, in the case of TFT strategy, 
agents will have the same continuation strategy both after the history ((C C) , (C C)) and after the his- 
tory ((£), C), (C, D), (C, C)). One can put all such histories into the same equivalence class. If one views 
these equivalence classes of histories as players' states, then a strategy profile can be viewed as an automaton. 

Let M = (Q,g ,/, r) be an automaton implementation of a strategy profile cr. It consists of a set of 
states Q, with the initial state q° G Q; of a profile of decision functions / = x ie Nf il where the decision 
function of player i, fi : Q A(A^), associates mixed actions with states; and of a transition function 
r : Q x A \-> Q, which identifies the next state of the automaton given the current state and the action 
profile played in the current state. 

Let M be an automaton. In order to demonstrate how M induces a strategy profile, one can first 
recursively define r(g, h 1 ), the transition function specifying the next state of the automaton given its initial 
state q and a history h l that starts in g, as 

\ T{q,h l )=T{ q) a ). 

With the above definition in hand, one can define c^, the strategy of player i induced by the automaton M, 

as 
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An example of a strategy profile implemented as an automaton is shown in Figure [3j This automaton 
implements the profile of two grim trigger strategies. The circles are the states of the automaton. The 
arrows are the transitions between the corresponding states; they are labeled with outcomes. The states are 
labeled with the action profiles prescribed by the profiles of decision functions. 



(C,C) 



(C, C) v (C, D) v (A C) v (A D) 




(C,D)v(D,C)v(D,D) 



Figure 3: An example of an automaton implementing a profile of two grim trigger strategies. The circles are 
the states of the automaton; they are labeled with the action profiles prescribed by the profiles of decision 
functions. The arrows are the transitions between the corresponding states; they are labeled with outcomes. 



Since any automaton induces a strategy profile, any two automata can be compared in terms of the 
utility they bring to the players. Let an automaton M induce a strategy profile a. The utility u](M) of the 
automaton M for player i is then equal to uj(cr), where u](a) is given by Equation 

Let \M\ denote the number of states of automaton M. If the value \M\ is finite, such automaton is called 
a finite automaton; otherwise the automaton is called infinite. In MAS, most of the time, we are interested 
in finite automata, because artificial agents always have a finite memory to stock their strategies and a finite 
processing power to construct them. 

Any finite automaton induces a strategy profile, however not any strategy profile can be represented using 
finite automata. |Kalai and Stanford ( 1988 ) demonstrated that any SPE can be approximated with a finite 
automaton. First of all, they defined the notion of an approximate SPE. For an approximation factor e > 0, 
a strategy profile a G E is an e-equilibrium in a repeated game, if for each player i and for all a- G 
u](o~) > u] (a^a-i) — e, where a = (ai,a-i). A strategy profile a G E is a subgame- perfect e-equilibrium 
(SPeE) in the repeated game, if for all histories h G H, the subgame strategy profile a\h is an e-equilibrium 
in the subgame induced by h. Kalai and Stanford] ( |1988| ) then proved the following theorem: 

Theorem 1 (Kalai and Stanford ( 1988| )). Consider a repeated game with the discount factor 7 and the 
approximation factor e. For any subgame- perfect equilibrium a, there exists a finite automaton M with the 
property that \u](a) — u](M)\ < e for all i, and such that M induces a subgame- perfect e-equilibrium. 



2.5 Problem Statement 

Let U 1 C W 1 be the set of all SPE payoff profiles in a repeated game with the discount factor 7. Let E 7 ' e C E 
be the set of all SPeE strategy profiles in a repeated game with the discount factor 7 and the approximation 
factor e. 

In this paper, the problem of an approximate subgame-perfect equilibrium computation is stated as 
follows: find a set W 5 U 1 with the property that for any v G W, one can find a finite automaton M 
inducing a strategy profile a G E 7 ' e , such that for all z, Vi — u](M) < e. 



3 Previous Work 



The work on equilibrium computation can be categorized into three main groups. For the algorithms of 
the first group, the problem consists in computing one or several stationary equilibria (or e-equilibria) given 
a payoff matrix. The discount factor is implicitly assumed to be equal to zero ( Lemke and Howson| |1964 



McKelvey and McLennan] |1996| |von Stengel] | 2QQ2| |Chen et al] |2QQ6| |Porter et al.||2QQ8[ ). For example, in 



the repeated Prisoner's Dilemma from Figure [T] the algorithms of the first group will only find the stationary 
equilibrium, 

<n(h)=D, Vi, Vft, 

whose payoff profile is (0,0). 
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The algorithms belonging to the second group represent the other extremity. They assume the discount 
factor to be arbitrarily close to 1. For instance, in two-player repeated games, this permits obtaining a 
polynomially fast algorithm for constructing automata inducing equilibrium strategy profiles ( |Littman and| 
Stone 2005). Indeed, when 7 tends to 1, the set of SPE payoff profiles U 1 converges to the following set: 



F* = {veF ] :Vi >2U, Vi}, 
where the minmax payoff of player i is defined as 

v rj = min max r^(a^, a_i). 

The set F* is called the set of feasible and individually rational payoff profiles. It is the smallest possible set 
that can be guaranteed to entirely contain the set of all SPE payoff profiles in any repeated game. Having in 
hand the set of SPE payoff profiles, in order to construct an SPE strategy profile, it is remaining to choose 
any point v G F* and to construct an automaton having a structure similar to TFT. More precisely, in 
this automaton, there will be one "in-equilibrium" (or, "cooperative") cycle that generates v as an average 
payoff profile, and two out-of-equilibrium (or, "punishment") cycles, one for each player, where the deviator 



obtains at most its minmax payoff during a finite number of periods (see Littman and Stone ( 2005 ) for more 
details). 

If the discount factor is viewed as the probability that the repeated game will be continued by the same set 
of players, it usually cannot be arbitrarily modified (e.g., moved closer to 1). The third group of algorithms 
for computing SPE payoffs and strategies aims at finding a solution by assuming that the discount factor 7 



is a fixed given value between and 1 (Cronshaw and Luenberger, 1994 Cronshaw, 1997 Judd et al. 2003). 



These algorithms are based on the concept of self- generating sets, introduced by |Abreu et al. (1990). 



Let us formally develop the idea of self-generation in application to the problem of computing the set of 
pure SPE payoff profiles. Given a strategy profile <r, one can rewrite Equation ([2| as follows: 



«7W 



(l- 7 )£ 7 V,(aV)) 



t=0 



(l_ 7 ) ri (aV))+ 7 



X> £ -W(a)) 



(l- 7 )r,(a°(a))+7<(a| a o (a) ). 



Let uj(ai,a\h*) denote player z's utility for playing action ai at history h l given the strategy profile a. Let 
a = (ai,a_i) be the action profile prescribed by strategy profile a at history h* , i.e., a = <j(/i £ ) = <t|^*(0). 
For all a* G Ai one can write, 



where h t+1 
J ( 



u](ai,a\ h t) = (l-j)ri(ai,a-i)+ju?(a\ h t+i), (3) 

h l • a is obtained as a concatenation of the history h l and the action profile a = (a^, a_^); and 
u-(a\ h t+i) represents the so-called continuation promise of the strategy a after the history (h 1 • a). 

Therefore, at each period of the repeated game, player i has a choice between different actions G 
each having a particular utility V{. Consequently, each period of the repeated game can be represented as 
a certain stage-game, whose payoffs are equal to the original stage-game payoffs augmented by the corre- 
sponding continuation promises. Let us call such new stage-game an augmented game. For instance, let the 
stage-game of the repeated game be as shown in Figure [4] Given a strategy profile a and a history h l , the 
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C D 
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C 
D 
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HC,D) 


r(D,C) 


r(D,D) 



Figure 4: A generic stage-game, 
augmented game corresponding to this stage-game is shown in Figure [5] 
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Figure 5: An augmented game for the generic stage-game from Figure [4] 



By reformulating the definition of subgame-perfect equilibrium, the strategy profile a is an SPE, if and 
only if it induces an equilibrium mixed action profile in the augmented game after any history. 

Let V 1 denote the set of pure action subgame-perfect equilibrium payoff profiles one wants to identify. 
Recall Equation ^\ at a history h l , in order to make part of a subgame-perfect equilibrium strategy, 
action has to be "supported" by a certain continuation promise u](a\ h t+i) : where h t+1 = h l • (a\ a_^) and 
(cLi^a-i) = By the property of subgame-perfection, this must hold after any history. Therefore, if 

does make part of a certain subgame-perfect equilibrium a at the history h l , then uj(cr\ h t+i) has to belong 
to V 7 , as well as u](ai, cr^t). This self-referential property of subgame-perfect equilibrium suggests a way 
by which one can find V 1 . 

Let BRi(a) denote a stationary best response of player i to the mixed action profile a = (c^,a_^), i.e., 

BRAa) = max rAdi^a-A. 

cueAi 

The analysis focuses on the map B 1 defined on a set W C W 1 : 

B^(W)= (J (l- 7 )r(a)+ 7 ™, 

(a,w)eAxW 



where w has to verify for all i: 

(1 -i)ri{a) +7^i - (1 



and ^ = inf we w ™i- Abreu et al.| ( |199Q| ) show that the largest fixed point of B 7 (W) is V 7 . 

Any numerical implementation of B 7 (W) requires an efficient representation of the set W in a machine. 
Judd et al. (2003) use convex sets in order to approximate both W and B 1 {W) as an intersection of a finite 



number of hyperplanes. Each application of B^(W) is then reduced to solving a simple linear program. The 
algorithm starts with a set W G M n that is guaranteed to entirely contain V 7 . Then it iteratively modifies 



W as W «— B 1 (W) until convergence. We omit further details: the interested reader can refer to Judd et al. 



(2003). 



The approach of Judd et al. (2003) has, however, several important limitations: 

1. It assumes the existence of at least one pure action equilibrium in the stage-game; 

2. It permits computing only pure action SPE strategy profiles; 

3. It cannot find SPE strategy profiles implementable by finite automata with given precision; 

4. It can only be naturally applicable if the set of SPE payoff profiles is convex. In practice, this is 
often not the case. This means that in order to be capable of adopting strategies computed by the 
algorithm, the players need to have a way to convexify the set of continuation promises by randomizing 
between them. This can be done, for example, by means of a special communication protocol (e.g., 



and Samuelson, 2006) 



jointly controlled lotteries by Aumann et al. (1995)) or by using a public correlating device (Mailath 



In the next section, we present three different formulations of our algorithm for solving the problem of an 
approximate SPE computation, as it was stated in Section [2] The first formulation is only free of the last two 
limitations of the approach of |Judd et al.| ( [2003 ). The second formulation is free of all four limitations, but 
it is not guaranteed to find a set containing all mixed strategy SPE payoff profiles. The third version of the 
algorithm, in turn, finds a set containing all (pure and mixed) SPE payoff profiles. However, it accomplishes 
this for the sake of convexifying the set of continuation promises, i.e., has the fourth limitation. 
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4 The Algorithms 



The fixed point property of the map B 1 and its relation to the set of SPE payoff profiles can be used 



to approximate the latter. Indeed, according to Abreu et al. (1990), if {%) for a certain set W we have 



B 7 (W) = W and (ii) W is the largest such set, then W = U 1 . The idea is to start by a certain set W that 
is guaranteed to entirely contain £7 7 , and then to iteratively eliminate all those points w' G W for which 
$(w, a) G W x A(Ai) x ... x A(A n ), such that, 

(1) w' = (1 - 7)r(a) + jw, and, ( . 

(2) {l- 1 )r i {a)^ 1 w i -{l- 1 )r i {BR i {a),a_ i )- 1 w i > 0, Vi 1 ; 

Algorithm [I] outlines the basic structure for three different formulations that will be defined in the 
following subsections. The algorithm starts with an initial approximation W of the set of SPE payoff 
profiles U 1 . The set W is represented by a union of disjoint hypercubes belonging to the set C. Each 
hypercube c G C is identified by its origin o c G W 1 and by the side length Z, the same for all hypercubes. 
Initially, C contains only one hypercube c, whose origin o c is set to be a vector (r)i e N] the side length I is 
set to be I = f — r, where r = min a ^ r^(a) and f = max a)i r^(a). I.e., W entirely contains U 1 . 

Input: r, a payoff matrix; 7 a discount factor; e, an approximation factor. 
1: Let I = r — r and o c = (r)^ G Ar; 

Set {(o c ,/)}; 

loop 

Set AllCubesCompleted <- True; 
Set NoCubeWithdrawn <- True; 
for each c = (o c , I) £ C do 
Let ^ = min cG c o^; 
Set w ^— (w i )i e N, 

if CubeSupported(c, C, ^) is False then 
Set C C\{c}; 
if C = then 

return FALSE; 
Set NoCubeWithdrawn <- False; 
else 

if CubeCompleted(c) is False then 
Set AllCubesCompleted <- False; 
if NoCubeWithdrawn is True then 
if AllCubesCompleted is False then 

Set C <- SplitCubes(C); 
else 

return C. 



9 
10 
11: 
12 
13 
14 
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16 
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19 
20 
21 



Algorithm 1: The basic structure for all proposed algorithms. 



Each iteration of Algorithm [T] consists of verifying, for each hypercube c G C, whether it has to be 
eliminated from the set C (procedure CubeSupported). If c does not contain any point w' satisfying the 
conditions of Equation Q, this hypercube is withdrawn from the set C. If, by the end of a certain iteration, 
no hypercube was withdrawn, each remaining hypercube is split into 2 n disjoint hypercubes with side 1/2 
(procedure SplitCubes). The process continues until, for each remaining hypercube, a certain stopping 
criterion is satisfied (procedure CubeCompleted). 



4.1 Pure Strategy Equilibria 

For the case where the goal is to only approximate pure action equilibria, the definition of the procedure 
CubeSupported is given in Algorithm [2] In this algorithm, the set of hyperrectangular clusters, 5, is 
obtained from the set of hypercubes, C, by finding a smaller set, such that the union of its elements is equal 
to the union of the elements of C (procedure GetClusters). Each s G S is identified by its origin o s G W 1 



9 



Input: c = (o c , Z), a hypercube; C, a set of hypercubes; u> a vector of payoffs. 
1: S <- GetClusters(C) 
2: for each s = (o s ,l s ) G 5 do 
3: for each a G A do 

4: Solve the following linear constraint satisfaction problem: 
Decision variables: w G W 1 and w' G M n ; 
Subject to constraints: 

(1) w' = (1 — 7)r(a) + 71^; 
For all i: 

(2) (1 - j)n(a) + 7^i - (1 - l)ri{BRi{a), a_») - 7^ > 0, 

(3) o|<^<o s + Z|, 

(4) o 2 c < < < o 2 c + Z; 

5: if a pair (w,w f ) satisfying the constraints is found then 
6: return (a,w); 

7: return False. 

Algorithm 2: CubeSupported for pure strategies. The procedure verifies whether a given hypercube c 
has to be kept in the set of hypercubes C. If yes, CubeSupported returns a pure action profile and a 
continuation promise. Otherwise the procedure returns False. 



and by the vector of side lengths I s G W 1 . The clusterization of the set C is done to speed up the algorithm 
in practice. In our experiments, we used a simple greedy algorithm to identify hyperrectangles. Of course, 
one can always define S = C. In this case, for each c = (o c ,Z) G C, there will be exactly one s G S, such 
that s = (o c , (Z)i G jv). 

The linear constraint satisfaction program of Algorithm [2] can be solved by any linear program solver. 
We used CPLEX ( |IBM, Corp] |2QQ9[ ) together with OptimJ ( |ATEJI| |2QQ9[ ) for solving all mathematical 
programs defined in this paper. 

If the conditions of Equation Q are verified, and c has to be kept in C, the CubeSupported procedure 
of Algorithm |2] returns a pure action profile a and a payoff profile w such that (l — j)r(a) +jw = w' belongs 
to the hypercube c. Otherwise, the procedure returns False. 



4.2 Mixed Strategy Equilibria 

Computing the set of all equilibria (i.e., pure action and mixed action, stationary and non stationary) is a 
more challenging task. To the best of our knowledge, there is no algorithm capable of at least approximately 
solving this problem. The previous pure strategy case was greatly simplified by two circumstances: 

1. It is possible to enumerate pure action profiles one by one in order to test all possibilities to satisfy the 
two conditions of Equation Q. 

2. Any deviation of player i from the recommended (by the equilibrium strategy profile) action profile a = 
(di,a-i) in a case, where a$ ^ {a- : r(a-,a_^) = ri(BRi(a), a_^)} is immediately detected by the other 
players. This makes possible to enforce the condition (2) of Equation Q in practice. 

When the action profiles, which strategy profiles can recommend, are allowed to be mixed, their one 
by one enumeration is impossible. Furthermore, deviations from mixed actions can only be detected if the 
deviation is done in favor of an out-of-the-support action. In game theory, the support of a mixed action c\i 
is a set C Ai, which contains all pure actions to which oti assigns a non-zero probability. Therefore, if 
player i plays an action a$ ^ Af% whenever it is supposed to play a mixed action on, only in this case the 
other players can immediately detect the deviation. Deviations that only involve the actions in the support 
of oti cannot be detected. 

We solve the two aforementioned problems in the following way. We first define a special mixed integer 
program (MIP). We then let the solver decide on which actions to be included into the mixed action support 
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of each player, and what probability has to be assigned to those actions. The MIP has to be solved for all 
agents simultaneously. Because the goal is to only satisfy the constraints of Equation Q, the presence of a 
particular objective function in the MIP is not generally necessary. In our implementation, we have chosen 
an objective to minimize the sum of the cardinalities of the supports. This means that, when possible, the 
preference is given to pure action strategies. 

Player i is only willing to randomize according to a mixture ctj, if it is indifferent over the pure actions in 
the support of the mixture. The technique is to specify different continuation promises for different actions 
in the support of the mixture, such that the utility of each action remains bounded by the dimensions of the 
hypercube. Algorithm [3] defines the procedure CubeSupported for the case, where the set of continuation 
promises is represented by the union of hyperrect angular clusters. 

For each hyperrectangular cluster, s, containing possible continuations, Algorithm [3] verifies whether, for 
the given hypercube c, one can find a mixed action profile a = (c^,a_^), such that for all i and for all 
di G Af% (1 — j)ri(a) + jWi(ai) = w[{ai) lies between of and of + If. This will satisfy the first condition of 
Equation Q . To satisfy the second condition, the choice of the mixed action and of the continuation payoffs 
has to be such that, for all i e N and for all ^ Af\ w_ { < Wi(ai) < + Z, to make any out-of-the-support 
deviation approximately unprofitable. 

Observe that in the MIP of Algorithm [3| we assign the continuation payoffs Wi{ai) to pure actions and 
not to pure action profiles (as, for example, can follow from the definition of an augmented game). This 
permits avoiding the non-linear term Q^*i^(a^, a_^) in the constraint (3) of the MIP. One can do this, 
without missing any continuation promise belonging to the cluster, thanks to the rectangular structure of 
the latter: for any action profile a = (ai, . . . , a n ) realized during the repeated game play, the corresponding 
continuation payoff profile {w\{a\), . . . , w n (a n )) will always be found inside, or on the boundary of, a certain 
hyperrectangle. This assures that the continuation payoff profiles belong to W. 

In Algorithm [3J the required indifference of player i between the actions in the support of the mixed 
action is (approximately) secured by the constraint (4) of the MIP. Observe that in an optimal solution of 
the MIP, the binary variables , known as indicator variables, can only be equal to 1 if is in the support 
of ai. Therefore, according to the constraint (4), each w'^ai) is either bounded by the dimensions of the 
hypercube, if G Af% or is below the origin of the hypercube, otherwise. 

Notice that the MIP of Algorithm [3] is only linear in the case of two players. For more than two players, 
the problem becomes non-linear due to the fact that ot-i is now given by a product of decision variables a^, 
for all j G 7V\{i}. For three players, for example, such optimization problem becomes a mixed integer 
quadratically constrained program (MIQCP); such optimization problems are generally very difficult as they 



combine two kinds of non-convexities: integer variables and non-convex quadratic constraints (Saxena et al. 



2008 ). 

The fact that all continuation payoff profiles are contained within one cluster makes the optimization 
problem easier to solve; however, such an approach also restricts the set of equilibria by allowing only those 
SPE, for which the continuation payoffs are always contained within a certain cluster. Nevertheless, the 
solutions that can be computed by Algorithm [3] include, among others, all pure strategy SPE (because, 
in this case, the continuation payoff profile is a unique point belonging to a certain cluster) as well as all 
stationary mixed strategy SPE (because for any i and any stationary SPE payoff i^(a^), belonging to a certain 
hypercube, the continuation payoff Wi(di) belongs to the same hypercube). A more general formulation of the 
MIP could, for example, allow the continuations for different action profiles to belong to different clusters. 
The task of selecting a particular cluster for Wi(a), for all z, can also be left to the solver. This would, 
however, again result in a non-linear MIP, because, now, the constraint (3) would look as follows, 

w i( a i) = ^2 (( X ~ 7 ) ri ( ai ' + 7^i( a i, 

a—i 

where ^(a^a^), the continuation payoff assigned to an action profile a = (a^,a_^), is bounded by the 
dimensions of a certain cluster. 

There is a way to modify Algorithm [3] so as to keep in W all SPE payoff profiles while preserving the 
linearity of the MIP, at least for two-player repeated games. This can be achieved by assuming a certain 
level of coordination between players during the game play. This is the subject of the next subsection. 
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Input: c = (o c , Z), a hypercube; C, a set of hypercubes; u> a vector of payoffs. 
1: S <- GetClusters(C) 
2: for each s = (o s ,l s ) G 5 do 
3: Solve the following mixed integer program: 

Decision variables: it;* (a*) G R, w-(a^) G R, y a { % G {0, 1}, a?* G [0, 1] for all i G {1,2} and for all 

CL{ G 

Objective function: min/ = J^ a . y^; 
Subject to constraints: 
For all i: 

(i) E B4 «?* = i; 

For all z and for all G A^: 

(2) 

(3) w-(ai) = (1 -7)E a _ i a -i' r >( fl ii a -<) +7w i (o i ), 

(4) ofyr<^K)<^ 0i +o?, 

(5) Mi - trf + of y°' < Wife) < fa + I) - (w t + l)y? + (of + lf)yf ; 
4: if a solution is found then 

5: return Wi(ai) and a"* for all i G {1, 2} and for all a$ G A^; 
6: return False. 

Algorithm 3: CubeSupported for mixed strategies. The procedure verifies whether a given hypercube c 
has to be kept in the set of hypercubes C. If yes, CubeSupported returns a mixed action profile a and 
the corresponding continuation promise payoffs for each pure action in the support of c^. Otherwise the 
procedure returns FALSE. 

4.3 Public Correlation 

A mixed SPE strategy profile a, after each history ft, suggests to the players a certain mixed action profile a 
and has a certain value w(a) associated with it. More precisely, w(a) is an expected continuation promise 
for playing mixed action a at history ft, such that u 7 (cr\h) = w' — (1 — j)r(a) + jw(a). Also, a\h induces 
a certain continuation payoff profile w(a) for each outcome a realized at ft. Because a is an SPE, every 
such w(a) belongs to Z7 7 , the set of SPE payoff profiles. However, w(a) does not necessarily belongs to /7 7 , 
because w(a) is a mixture ^Z aeA a a w(a). On the other hand, for any a, w(a) does belong to co/7 7 , the 
convex hull of the set of SPE payoff profiles. 

Let us assume that one can select, as a continuation payoff profile, any payoff profile from coiy. In the 
MIP of the CubeSupported procedure, one can, therefore, associate continuation payoffs Wi with player z's 
actions, and not with action profiles. This would permit avoiding the previously seen non-linearity when 
we allowed the continuations to belong to different hypercubes. To achieve this, one can rewrite w(a) as 

(wi(ot)) ieN , where Wi(a) = £) a . a^w^a^a) and Wi(ai\oi) = ^L, a > eA:a > i=ai Wi(a f ) H jeN \ {i} o^ 3 \ Let w' = 
(w[)i e N be an SPE payoff profile and let one want to identify a and w(a) in support of w' . If G Af% 
then, for all i, 

w[ = (l- 7)r i (a i |a) + 7w i (a i |a), 

where Ti(ai\a) = ^ a > e A-a'=ai r i( a ') YljeN\{i} a °j 3 • ^ OT ^ wo pl a y ers ? the right-hand expression for w[ is linear. 
Furthermore, for any choice of a, the payoff profile obtained as (wi(di\a))i e N is a point in R n that belongs 
to co W\ One can now modify the optimization problem of Algorithm [3] so as to keep inside the convex hull 
of W any point (wi(di))i e N, such that G Vi. In doing so, we are guaranteed to keep in W all possible 
SPE payoff profiles. 

A convexification of the set of continuation payoff profiles can be done in different ways, one of which 
is public correlation. A repeated game with public correlation is a repeated game, such that in every stage- 
game, a realization uj G (0, 1] of a public random variable is first drawn, which is observed by all players, 
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and then each player chooses an action. The public signal uo can be generated by a certain public corre- 
lating device (Mailath and Samuelson, 2006). This device has to be capable of generating instances of a 
given random variable and to be unbiased, i.e., indifferent with regard to the repeated game outcomes. A 
public correlating device can be simulated by a special communication protocol, such as a jointly controlled 
lottery (Aumann et al. , 1995[ ). 

Let a be an SPE strategy profile that suggests playing a mixed action profile a at h l and promises a 
continuation payoff profile w(a) for each a G A. If, for all a, w(a) G Z7 7 , no public correlation is necessary: 
for any possible h t+1 = h f - a, there exists cr\ h t+i G X 7 , such that u 7 (a\ h t+i) = w(a). Now, let us suppose that 
after playing a mixed action a at ft*, an outcome a has been realized, such that w(a) G co/7 7 \/7 7 . In this 
case, one cannot find any strategy cr\ h t+i G H 7 , such that u^(a\ h t+i) = w(a). On the other hand, by using 
a correlating device during the game play, the players can obtain, in expectation, the continuation payoff 
profile w(a) as a convex combination of K points Wk G Z7 7 '. I.e., there exist K non-negative real numbers pk 

with the property that Y^k=i Pk w k — w(a) and Y^k=i P k = ^' 

Let a public correlating device be available and capable of generating a uniformly distributed signal 

uj G (0, 1] when needed. Define po = 0. If uj G (Sj=o Pj-> Ylk=o Pj ^ or some k = 1, . . . ,K, then cr\ h t+i is set 
to be a a G H 7 such that u 1 '(a) = Wk- By so doing, any SPE payoff profile v, computed assuming that the 
set of continuation payoff profiles is convex, can in practice be induced by a certain SPE strategy profile a. 
To achieve this, the transition function of the automaton implementation of a has to be modified into a 
mapping / : Q x A x (0, 1] \-> Q, such that f(q,a,w) specifies the next state of the automaton, given that 
the outcome a G A was first realized in the current state q G Q and then uj G (0, 1] was drawn. 



4.3.1 The Algorithm 

Algorithm|4]contains the definition of the CubeSupported procedure that convexifies the set of continuation 
promises. The definition is given for two players, i.e., N = {1,2}. The procedure first identifies coif, the 
smallest convex set containing all hypercubes of the set C (procedure GetHalfplanes). This convex set 
is represented as a set P of half-planes. Each element p G P C M 3 is a vector p = (</> p , A p ), such that 
the inequality (j) p x + ip p y < X p identifies a half-plane in a two-dimensional space. The intersection of these 
half-planes gives coW. In our experiments, in order to construct the set P from the set C, we used the 
Graham scan, an efficient technique to identify the boundary points of the convex hull of a set ( | Graham] 



1972). 



The procedure CubeSupported defined in Algorithm [4] differs from that of Algorithm [3] in the following 
aspects. It does not compute clusters and, consequently, does not iterate. Instead, it convexifies the set W 
and searches for continuation promises for the hypercube c inside co W. The definition of the MIP is 
also different. New indicator variables, z ai,a2 , for all pairs (ai,<22) G A\ x A^ are introduced. The new 
constraint (6), jointly with the modified objective function, verify that z aiyCL2 is only equal to 1 whenever 
both y® 1 and y% 2 are equal to 1. In other words, z ai,a2 — 1, only if a\ G A^ 1 and 02 G A^ 1 . Another new 
constraint (7) verifies that (yj\(a\), 1^2(^2))? the continuation promise payoff profile, belongs to coW if and 
only if (ai,<22) G A^ 1 x A^ 2 . Notice that in the constraint (7), M stands for a sufficiently large number. 
In constrained optimization, this is a standard technique for relaxing a given constraint by using binary 
indicator variables. 



4.4 Computing Strategies 

Algorithm [T] returns the set of hypercubes C, such that the union of these hypercubes gives W, a set that 
contains U 1 . Intuitively, each hypercube represents all those strategy profiles that induce similar payoff 
profiles. Therefore, one can view hypercubes as states of an automaton. Pick a point v G W. Algorithm [5] 
constructs an automaton M that implements a strategy profile a that approximately induces the payoff 
profile v. 



4.5 Stopping Criterion 

The values of the flags NoCubeWithdrawn and AllCubesCompleted determine whether the basic 
algorithm (Algorithm HI should stop and return the set W approximating the set of SPE payoff profiles 



13 



Input: c = (o c , Z), a hypercube; C, a set of hypercubes. 
1: P <- GetHalfplanes(C) 

2: Solve the following mixed integer linear optimization problem: 

Decision variables: it;* (a*) G R, w-(a*) G ^> 2/?* G {0, 1}, G [0,1] for all z G {1,2} and for all 
a 4 G Ai\ z ai > a2 G {0, 1} for all pairs (ai,a 2 ) G Ai x A 2 . 

Objective function: min/ = E(a 1 ,a 2 )eA 1 xA 2 * ai,aa - 
Subject to constraints: 

For all % G {1,2}: 

(i) £o,«? = i; 

For all z G {1, 2} and for all G Af. 

(2) 

(3) ly-(ai) = (1 -7)Z)a_< + 7^i( a i)> 

(4) ofi/? 4 <<(ai)<iyr+of, 

(5) ^ - < ^(a*) < + I) - + Z)^ + f# 4 ; 
For all a\ G Ai and for all a 2 G A 2 : 

(6) i/r+j/a 2 <^ ai,aa + i; 

For all p = X p ) G P and for all pairs (ai, a 2 ) G Ai x A 2 : 

(7) c/Pw^ai) + i/j p w 2 (a 2 ) < \Pz ai ^ + M — Mz a ^ a \ 

3: if a solution is found then 

4: return Wi(di) and a?* for all i G {1,2} and for all £ Ai. 
5: return FALSE 

Algorithm 4: CubeSupported for mixed actions and public correlation. The procedure verifies whether a 
given hypercube c has to be kept in the set of hypercubes C. If c has to be kept in C, CubeSupported 
returns a mixed action profile and the corresponding continuation promise payoffs for each pure action in 
the support of mixed actions. Otherwise the procedure returns False. 



(and entirely containing it). At the end of each algorithm's iteration, the flag AllCubesCompleted is 
only True, if for none of the remaining hypercubes c G C, CubeCompleted(c) is False. The procedure 
CubeCompleted, in turn, verifies, for hypercube c, that the two conditions of the problem stated in 
Subsection |2.5| are satisfied, namely: 



1. For any v G W, the strategy profile a, implemented by the automaton M, constructed by Algorithm [5j 
induces the payoff profile ^ 7 (<r), such that, for all i, Vi — u](cr) < e, 

and 

2. The maximum payoff gi that each player i can achieve by unilaterally deviating from a is such that 

Both conditions can be verified by dynamic programming. For example, the second condition can be verified 
by using the value iteration algorithm QSutton and Barto 1998). To do this, the deviating agent i has to be 



considered as the only decision maker (optimizer). The remaining agents' strategy profile <r_^ can then be 
viewed as the decision maker's environment. 



5 Theoretical Analysis 

In this section, we examine the theoretical properties of Algorithm]]] for the case of mixed strategies. While 
the procedure CubeSupported for pure strategies (Algorithm |2| is defined differently, mixed strategies 
include pure ones. Therefore, in our theoretical analysis, we concentrate on two more general cases: mixed 
strategies with no external coordination (Algorith m [T| with CubeSupported given by Algorithm |3| and 
mixed strategies with public correlation (Algorithm [1] with CubeSupported given by Algorithm |4| . 
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Input: C, a set of hypercubes, such that W is their union; v G W, a payoff profile. 
1: Find a hypercube c G C, which u belongs to; set Q {c} and g° «— c; 
2: for each player i do 

3: Find w l — min wG jy and a hypercube c 1 G C, which w 1 belongs to; 

4: Set Q^-QU^}; 

5: Set / <- H> Xi A(^); 

6: Set r <- H> C. 

7: loop 

8: if Q = then 

9: return M = (Q,q°, f,r). 

10: Pick a hypercube g G Q, for which /(g) is not defined. 

11: Apply the procedure CubeSupported (g) and obtain a (mixed) action profile a and continuation 

payoff profiles w(a) for all a G XiA?\ 
12: Define f(q) = a. 
13: for each a G x^A^ do 

14: Find a hypercube c G C, which u>(a) belongs to, set Q ^— Q U {c}; 
15: Define r(g, a) = c. 

16: for each i and each a* G (A\A^) x jeN \^y A^ 1 do 
17: Define r(q, a 1 ) = c\ 

Algorithm 5: Algorithm for constructing an automaton M that approximately induces the given payoff 
profile v. 



Theorem 2. For any repeated game, discount factor 7 and approximation factor e, (1) Algorithm^ termi- 
nates in finite time, (2) C contains at least one hypercube, and (3) for all c G C, Algorithm^ terminates in 
finite time and returns a finite automaton M that satisfies: 

1. The strategy profile a implemented by M induces the payoff profile v = u^(a), such that, for all i, 

0\ -Vi<6, 

and 

2. The maximum payoff gi that each player i can achieve by unilaterally deviating from a is such that 
Qi-Vi < e. 

The proof of Theorem [2] relies on the following lemmas. 

Lemma 1. At any point of execution of Algorithm^ C contains at least one hypercube. 



Proof. According to |Nash| (1950a), any stage-game has at least one equilibrium. Let v be a payoff profile 



of a certain Nash equilibrium in the stage-game. For the hypercube c that contains v, the procedure 
CubeSupported will always return True, because for any 7, v satisfies the two conditions of Equation Q, 
with w' = w = v and a being a mixed action profile that induces v. Therefore, c will never be withdrawn. □ 

Lemma 2. An iteration of Algorithm^ such that NoCubeWithdrawn is True, will be reached in finite 
time. 

Proof. Because the number of hypercubes (and, therefore, the number of clusters) is finite, the procedure 
CubeSupported given by Algorithm [3] will terminate in finite time. The same is true for CubeSupported 
given by Algorithm [4] For a constant Z, the set C is finite and contains at most |~(f — r)/f\ elements. 
Therefore, and according to Lemma [l] after a finite time, there will be an iteration of Algorithm [l] such 
that for all c G C, CubeSupported(c) returns True. □ 

Lemma 3. Let C be the set of hypercubes at the end of a certain iteration of Algorithm [7J such that 
NoCubeWithdrawn is True. For all c G C, Algorithm^ terminates in finite time and returns a complete 
finite automaton. 



15 



Proof. By observing the definition of Algorithm [5j the proof follows from the fact that the number of 
hypercubes and, therefore, the possible number of the automaton states is finite. Furthermore, the definition 
of the automaton will be complete, because the fact that NoCubeWithdrawn is True implies that for 
each hypercube c G C, there is a mixed action a and a continuation payoff profile w belonging to a certain 
hypercube d G C. Consequently, for each state q of the automaton, the functions f(q) and r(q) will be 
defined. □ 

Lemma 4. Let C be the set of hypercubes at the end of a certain iteration of Algorithm [7J such that 
NoCubeWithdrawn is True. Let I be the current value of the hypercube side length. For every c G C , the 
strategy profile a, implemented by the automaton M that starts in c, induces the payoff profile v = u 7 (M), 
such that, for all i, o c { — V{ < 

Proof. When player i is following the strategy prescribed by the automaton constructed by Algorithm [5| this 
process can be reflected by an equilibrium graph, as the one shown in Figure [6j Because for all hypercubes c 




Figure 6: Equilibrium graph for player i. The graph represents the initial state followed by a non-cyclic 
sequence of states (nodes 1 to Z) followed by a cycle of X states (nodes Z + 1 to Z + X). The labels over 
the nodes are the immediate expected payoffs collected by player i in the corresponding states. 



behind the states of the automaton, CubeSupported returns True, we have: 

(1.1) oj < (l- 7 )rl+ 7 wl<o} + l, 

(1.2) o\<w\ <q? + Z, 

(2.1) of <(l- 7 K 2 + 7 ^ 2 <of + /, 

(2.2) of < w\ < of + I, 

(Z.l) of < (1 - 7 )rf + jwf < of + I, 

(Z.2) of + 1 < wf < of + 1 + I, 

(Z+l.l) of +1 < (1 - 7 )rf +1 + 7 wf +1 < of + Z, 

(Z+1.2) of+^wf+^of^ + l, 

(Z+X.l) of +x < (1 - 7 )rf +x + 7 ™f +X < of +X + Z, 

(Z+X.2) of +1 <wf +x <o? +1 +l, 



where o^, and k;^ stand respectively for (i) the payoff of player i in the origin of the hypercube behind 
the state q, (ii) the immediate expected payoff of player i for playing according to fi(q) or for deviating 
inside the support of fi(q), and (Hi) the continuation promise payoff of player i for playing according to the 
equilibrium strategy profile in state q. 
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The following development only uses the inequalities of Equation (J5J, one by one. It starts with inequal- 
ity (Z+l): 

of +1 < (l- 7 )rf +1 +7^f +1 

(By inequality (Z+1.2)} 

< (1 - T )rf + 1 + 7 (of + 2 + Z) 
(By inequality (Z+2.1)) 

< (1 - 7 )rf +1 + 7 ((1 - i)rf +2 + 7«f +2 ) + 7? 
(By inequality (Z+2.2)) 

< (1 - 7 )rf +1 + 7(1 " l)rf +2 + J 2 of+ 3 + ^ 



(6) 



(By inequality (Z+X.2)) 



x 



< (l- 7 )^ 7 -Vf+^+7 X f +1 +7E^" 1Z ( 7 ) 

x=l x—1 

Denote by gf the long-term expected non-normalized payoff for player i for passing through the cycle A of 
the equilibrium graph infinitely often. 

x 



^-l r f +X+1 X g A (g) 



l- 7 * 

The property of the infinite sum of the geometric series allows us to write: 



(9) 



From Equations (6][T0) it follows that, 



x=l 



o z+1 >(l-7k A + T ^-. (11) 
1-7 



Using inequalities (1.1 - Z.2) of Equation ([5|, the following development is possible: 

(By inequality (1.1)) 
o\ < (1 - j)rj + -ywl 

(By inequality (1.2)) 

< (1-7)^+7(^ + 
(By inequality (2.1)) 

< (1 - l)r] + 7 ((1 - l)r 2 i + ) + 7^ 
(By inequality (2.2)) 

< (1 - i)r\ + 7(1 - iVi + 7>? + 0+7* 

(By inequality (Z.2)) 

< (l- 7 )^ 7 -^+7 Z of +1 +7E^" 1Z 

2=1 Z=l 



(12) 



(13) 
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From Equations (11) and (12) it follows that, 



o!<(l- 7 ) E^V? 



•tV 



1-7' 



(14) 



Denote by gf the long-term expected (normalized) payoff for player i for passing through the equilibrium 
graph (graph B in Figure [7^) infinitely often. Observe that, 



ft 



(!-7) E7^+7V 



Therefore, 



1-7 



□ 



Lemma 5. Let C be the set of hypercubes at the end of a certain iteration of Algorithm [7J such that 
NoCubeWithdrawn is True. Let I be the current value of the hypercube side length. For every c G C, 
the maximum gain gi that each player i can achieve by unilaterally deviating from the strategy profile a 
implemented by an automaton M that starts in c and induces the payoff profile v = u 1 (M) is such that 
9i-Vi< j^. 

Proof. To prove the lemma, one has to bound the maximum gain of a deviation that starts in an arbitrary 
state of an automaton. Consider two deviation graphs for player i depicted in Figure [7J A deviation graph 




(a) 




Figure 7: Deviation graphs for player i. (a) A generic deviation graph for player i. The graph represents 
the initial deviation state (node 0) followed by a transition into the punishment state (node 1) followed by 
a number of in-equilibrium (or, otherwise, inside-the-support deviation) states (nodes 1 to L — 1) followed 
by the subsequent out-of-the-support deviation state (node L). (b) A particular, one state deviation graph, 
where the only deviation state is the punishment state for player i. The labels over the nodes are the 
immediate expected payoffs collected by player i in the corresponding states. 

for player i is a finite graph, which reflects the optimal behavior for player i assuming that the behavior of 
the other players is fixed and is given by an automaton returned by Algorithm [5] The nodes of the deviation 
graph correspond to the states of the automaton. The labels over the nodes are the immediate expected 
payoffs collected by player i in the corresponding states. A generic deviation graph for player i (Figure [7^) 
is a deviation graph that has one cyclic and one non-cyclic part. In the cyclic part (subgraph A), player i 
follows the equilibrium strategy or deviations take place inside the support of the prescribed mixed actions 
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(nodes 1 to L — 1, with node 1 corresponding to the punishment stat^Jfor player i). In the last node of the 
cyclic part (node L), an out-of-the-support deviation takes place. The non-cyclic part of the generic deviation 
graph contains a single node corresponding to the state, where the initial out-of-the-support deviation of 
player i from the SPE strategy profile occurs. If the state of the initial deviation of player i is itself the 
punishment state for player z, then the deviation graph will look as shown in Figure [7}}. 

The present proof only considers the generic deviation graph (Figure [7^) ; the proof for the particular 
cases, like that of Figure [7}}, can be obtained by analogy, and, because they bring the same result, we omit 
it here. Consider first the subgraph A of the generic deviation graph. Because, for all hypercubes c behind 
the states of the automaton, CubeSupported returns True, we have: 

(1.0) o\ <Wi < o\+l, 

(1.1) o] < (1 - i)r\ + jw} < o\ + Z, 

(1.2) o? < < o? + Z, 

(2.1) o?<(l- 7 )r?+ 7 ti;?< ? + Z, 

(2.2) of<^ 2 <of + Z, 



(15) 



(Z.l) of < (1 - 7K Z + ~fwf < of + Z, 

(Z.2) (1 - 7 )rf + jwf - (1 - i)BR? - -yuii > 0, 

where of, r\ and stand respectively for (i) the payoff of player i in the origin of the hypercube behind 
the state (ii) the immediate expected payoff of player i for playing according to fi(q) or for deviating 
inside the support of fi(q), and (Hi) the continuation promise payoff of player i for playing according to the 
equilibrium strategy profile in state q. 



The following development only uses the inequalities of Equation ( 15 ), one by one. It starts with inequal- 
ity (1.1): 



o > 



> 



> 



> 



> 



> 



> 



> 



1 - 7 )r* + jwj - I 
By inequality (1.2)) 

1 - 7)7*1 + ^2 _ l 

By inequality (2.1)) 

1 - i)r\ + 7 ((1 - l)r 2 i + 7^ 2 " " I 

By inequality (2.2)) 

1 - i)r\ + 7(1 - l)r- + 7^ - 7^ " ' 



By inequalities (3.1) to (Z-l.l)) 

z-i z- 1 

i-7)E^" v +^ _1 " z 



2=1 



z-i 



z=l 

By inequality (Z.l)) 
z-i 

1 - 7) E v + ^ z_1 ((! - ^) r ? + W - - 7 E -y* -1 ' 

2=1 2=1 
By inequality (Z.2)) 

z-i z-i 

1 - 7) E ^" v + ((! - ^ BR i + 7^ - - 7 E 

2=1 2=1 
By inequality (1.0)) 

z-i \ z 



1-7) E^" v +^ _1 ^f 



(16) 



2 The punishment state for player i is the automaton state, which is based on the hypercube that contains a payoff profile v, 
such that Vi = w- . 
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Denote by gf the long-term expected non-normalized payoff of player i for passing through the cycle A of 
the generic deviation graph infinitely often: 



9f = Y^^rf+^-'BRf + ^gf (17) 

Z = l 



The property of the infinite sum of the geometric series permits us to write: 

z 



(18) 



From Equations ( T6fl9 ) it follows that 



z=l 1 



1 >(l-7k A - r ^. (20) 
1-7 



(21) 



Furthermore, we have, 

(1.1) (1 - 7)rf + 7^9 - (1 - j)BB% - jw.i > 0, 

(1.2) o< (1 _ 7)r o +7 ^o< o + L 

The following development is possible: 

(By inequality (1.1) of Equation (21)) 

(l-7K°+7^° > (l-7)B«i +1W.i 

(By inequality (1.2) of Equation ([21}) 
o? > (l- 7 )BijO+ 73 ^-| 

(By inequality (1.0) of Equation ( [15] )) 
> (i- 7 )Bi20 +7O J+ 7 Z-Z (22) 



From Equations (20) and (22) it follows that 

o? > (1 - 7)£#° + 7 ((1 " 7^ " 3-^) + 7* " J- (23) 

Denote by gf the long-term expected (normalized) payoff of player i for following the generic deviation graph 
(graph B in Figure^). Observe that, 

gf = (1- 7)^° + 7(l "7k- 

Therefore, 

9?-o?<^--il + l. 
1-7 

Finally, by Lemma [4j starting from the state that corresponds to the node of the generic deviation graph, 
the payoff profile v, induced by the automaton, satisfies: < + V{. Therefore, 



< 



1 — 7 1 — 7 
21 



1-7 

□ 



Lemma 6. Algorithm^ terminates infinite time. 

Proof. The hypercube side length I is reduced by half every time that no hypercube was withdrawn by the 
end of an iteration of the algorithm. Therefore, and by Lemma [2j any given value of I will be reached after a 
finite time. By Lemmas [4] and [5j Algorithm [I] in the worst case, terminates whenever I becomes lower than 
or equal to e ^~ 7 ^ . □ 
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Figure 8: Four game matrices: (a) Duopoly game, (b) Rock, Paper, Scissors, (c) Battle of the Sexes and 
(d) Game with no pure action Nash equilibrium in stage-game. 



6 Experimental Results 

In this section, we present several significant experimental results for a number of well-known games. These 
are Prisoner's Dilemma (Figure [T]), Duopoly (Figure [8^t) , Rock, Paper, Scissors (Figure [8)3), Battle of the 
Sexes (Figure [8^), and a game with no stage-game pure action equilibrium (Figure |8]i). For these games, 
certain equilibrium properties are known or can be readily analytically verified. 

The graphs in Figure [9] reflect, for three different values of the discount factor, the evolution of the set of 
SPE payoff profiles computed by Algorithm [I] for the case of mixed strategies with public correlation in the 
repeated Prisoner's Dilemma. Here and below, the vertical and the horizontal axes of each graph correspond 
respectively to the payoffs of the first and the second players. The upper and lower limits of each axis are 
given respectively by f and r. The numbers under the graphs reflect the algorithm's iterations. The red 
(darker) regions on a graph reflect the hypercubes that remain in the set C by the end of the corresponding 
iteration. One can see in Figure [9^i that when 7 is sufficiently large, the algorithm maintains a set that 
converges towards the set F* of feasible and individually rational payoff profiles, the largest possible set of 
SPE payoff profiles. On the other hand, in Figure^, one can see that when 7 is close enough to the set of 
SPE payoff profiles converges, as expected, towards the point (0, 0) that corresponds to the Nash equilibrium 
of the stage-game: a strategy profile that prescribes playing D at every repeated game period. 

Rock, Paper, Scissors (RPC) is a symmetrical zero-sum game. In the repeated RPC game, the point (0, 0) 
is the only possible SPE payoff profile, regardless of the discount factor. This payoff profile can be realized 
by a stationary strategy profile prescribing to each player to sample actions from a uniform distribution. 
The graphs in Figure [10| certify the correctness of Algorithm [I] in this case. 

Battle of the Sexes (BoS) is the game that has two pure action stage-game equilibria, (O, O) and (F, F), 
with payoff profiles respectively (1, 2) and (2, 1). The game also has one mixed action stage-game equilibrium 
with payoff profile (2/3,2/3). When 7 is sufficiently close to 0, the set of SPE payoff profiles computed by 
Algorithm [l] converges towards these three points (Figure [TTfc), which is the expected behavior. As 7 grows, 
the set of SPE payoff profiles becomes larger (Figure [ilk) . We also ascertained that when the value of 7 
becomes sufficiently close to 1, the set of SPE payoff profiles converges towards F* and eventually includes 
the point (3/2,3/2). The latter point is interesting in that it maximizes the Nash product (Nash, 1950b[ ). 

It was particularly interesting for us to see whether, when applied to the repeated Duopoly game (Fig- 
ure [8^1), Algorithm [l] for pure strategies preserves the point (10, 10) in the set of SPE payoff profiles. Abreu 



( |1988| ) showed that this point can only make part of the set of SPE payoff profiles, if 7 > 4/7. In our experi- 
ments, we observed that for e = 0.01, the point (10, 10) indeed remains in the set of SPE payoff profiles, when 



4/7 < 7 < 1. Moreover, the payoff profile (0,0) of the optimal penal code, which was proposed by Abreu 
(1988) as the profile of punishment strategies, does also remain there (Figures 12 1 and b). Algorithm [1] also 



returns an automaton that induces a strategy profile that generates the payoff profile (10, 10). Interestingly, 
this automaton induces a strategy profile, which is equivalent to the optimal penal code based strategy profile 



proposed by Abreu (1988). To the best of our knowledge, this the first time that optimal penal code based 
strategies, which so far were only proven to exist (in the general case), were algorithmically computed. 

Another experiment was conducted with the game that does not possess any pure action stationary 



equilibrium (Figure ^p). In such games, for lower discount factors, the algorithm of Judd et al. (2003), that 
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(a) 7 = 0.7, e = 0.01 





LLl 

8 10 20 



(b) 7 = 0.3, e = 0.01 




Figure 9: The evolution of the set of SPE payoff profiles computed by Algorithm [T] for mixed strategies 
with public correlation in the repeated Prisoner's Dilemma. The numbers under the graphs reflect the 
algorithm's iterations. The red (darker) regions denote the hypercubes that remain in the set C by the end 
of the corresponding iteration. 




Figure 10: The evolution of the set of SPE payoff profiles computed by Algorithm [T] for mixed action with 
public correlation in the repeated Rock, Paper, Scissors with 7 = 0.7 and e = 0.01. 
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(a) 7 = 0.45, e = 0.01 




2 4 6 8 10 15 

(b) 7 = 0.05, e = 0.01 



Figure 11: The evolution of the set of SPE payoff profiles computed by Algorithm [T] for mixed actions without 
public correlation in the repeated Battle of the Sexes. 




(a) (b) 

Figure 12: SPE payoff profiles in repeated Duopoly game computed by Algorithm [I] for pure strategies with 
7 = 0.6 and e = 0.01. (a) The evolution of the set of SPE payoff profiles through different algorithm's 
iterations, (b) Abreu's optimal penal code solution is contained within the set of SPE payoff profiles. 



23 



7 e [0.01, 0.4] 7 0.45 7 = 0.5 7 = 0.7 7 = 0.9 

Figure 13: The sets of SPE payoff profiles computed in the repeated game from Figure [8]i with e = 0.01 for 
different values of the discount factor. 



e 


I 


Iterations 


Time 


0.025 


0.008 


55 


1750 


0.050 


0.016 


41 


770 


0.100 


0.031 


28 


165 


0.200 


0.063 


19 


55 


0.300 


0.125 


10 


19 


0.500 


0.250 


5 


15 



Table 1: The performance of Algorithm [T] in the repeated Battle of the Sexes for different values of the ap- 
proximation factor e. The second column represents the hypercube side length I at the end of the algorithm's 
execution; the third column contains the number of iterations until convergence; the last column contains 
the overall execution time in seconds. 



can only compute pure action strategy and payoff profiles, is incapable of returning any SPE point. On the 
other hand, Algorithm [T] does return a non-empty SPE set for the whole range of values of the discount 
factor (Figure 13). 

Finally, the numbers in Table [I] demonstrate how different values of the approximation factor e impact 
the performance of Algorithm [l] (with clusters) in terms of (z) number of iterations until convergence and 
(ii) time spent by the algorithm to compute a solution. The game used in this experiment is the repeated 
Battle of the Sexes from Figure [Sfc. 



7 Discussion 

We have presented an approach for approximately computing the set of subgame-perfect equilibrium (SPE) 
payoff profiles in repeated games and for deriving strategies implementable as finite automata and capable 
of approximately inducing those payoff profiles. To our knowledge, this is the first time that both these goals 
are achieved simultaneously. 

Furthermore, for the setting where no coordination during the game-play is possible, our algorithm returns 
the richest set of SPE payoff profiles among all existing algorithms for repeated games. More precisely, it 
returns a set that contains all stationary Nash equilibrium payoff profiles, all non-stationary pure SPE 
payoff profiles, and a subset of non-stationary mixed SPE payoff profiles. In a case where a certain level of 
coordination can be assumed, such as the availability of a public correlating device, our algorithm returns a 
set containing all SPE payoff profiles, while satisfying the necessary approximation properties. 

In this paper, we adopted a usual assumption that the discount factor, 7, is the same for all players. 
However, our algorithms can readily be modified to incorporate player specific discount factors. Furthermore, 
for simplicity of presentation, we assumed that the hypercube side length, Z, is the same for all players. This 
is also not a strict requirement; it is straightforward to generalize all algorithms and theoretical results to 
the case of player specific hypercube side lengths. 

One formulation of the procedure CubeSupported assumes the presence of a source of a commonly 
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observed random signal (public correlating device). A natural question would be why not aiming, in that 
case, at computing a richer set of subgame perfect correlated equilibrium (SPCE) payoff profiles ( jAumanri] 



1987). Indeed, several algorithms for computing the set of SPCE payoff profiles and the strategies to achieve 



them have recently been proposed (Murray and Gordon, 2007 Permed and Isbell| 2009). Our algorithm can 



also be transformed into one for approximating the set of SPCE payoff profiles. Indeed, in that case, the 
mathematical programming problem for the CubeSupported procedure will be even simpler than that for 
SPE. This is due to the fact that for computing a correlated equilibrium, one has to find a unique probability 
distribution for players' action profiles, and not a profile of probability distributions whose product enforces 
equilibrium. 

However, in order to implement correlated equilibria in practice, one has to have a reliable third-party 
mediator that can send private signals to the players before every repeated game stage. Furthermore, at 
every period, the signals coming to the players have to be thrown from a specific distribution, different 
at different repeated game stages. In the presence of communication, the mediator can be replaced by a 
special communication protocol (|Dodis et al. , 2000). Nevertheless, each stage of the repeated game has to 



be preceded by a round of communication in order to simulate the mediator. 

On the other hand, SPE equilibria computed using Algorithm [I] with the CubeSupported procedure 
given by Algorithms [2] or [3] neither require a mediating party nor a communication. Furthermore, the 
assumption of public correlation, adopted in order to implement Algorithm |4j only requires the presence of 
a source of a (constant) uniformly distributed signal that has to be observed by all players only at certain 
repeated game periods. This is a significantly less restrictive assumption than the one that has to be satisfied 
for implementing correlated equilibria in practice. 

Algorithm [I] with the CubeSupported procedure given by Algorithm [2] can be straightforwardly ex- 
tended to stochastic games while preserving the linearity of the mathematical programming problem of the 
CubeSupported procedure. In more general cases, however, the existence of multiple states in the envi- 
ronment is a source of non-linearity. The latter property, together with the presence of integer variables, 
require special techniques to solve the problem; this constitutes subject for future research. 
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